In [1]:
%env CUDA_VISIBLE_DEVICES=0 # limit GPU usage, if any to this GPU
In [2]:
import numpy as np
from classifier import common
import os
labels = common.fetch_samples()
from sklearn.model_selection import train_test_split
np.random.seed(123)
y_train, y_test, sha256_train, sha256_test = train_test_split(
list(labels.values()), list(labels.keys()), test_size=1000)
Clearly, my model needs a heavy dose of special fanciness. We'll turn to ResNet architecture! This builds on the end-to-end model in the previous notebook, but stacks things higher and deeper, thanks to our hand-crafted residual cell.
This model may require a hefty GPU. I ran it successfully on a single TITAN X (Pascal). If yours doesn't work, try enabling the lite=True
options in create_model.
You can find code that defines the MalwaResNet model architecture at classifier/malwaresnet.py.
In [3]:
# for this demo, will slurp in only the first 256K (2**18) bytes of the file
max_file_length = int(2**18)
file_chunks = 16 # break file into this many chunks
file_chunk_size = max_file_length // file_chunks
batch_size = 4
In [ ]:
# That this is very long running cell, and
# it may appear that the output is truncated before training completes
# let's train this puppy
from classifier import malwaresnet
import math
from keras.callbacks import LearningRateScheduler, EarlyStopping, ModelCheckpoint
# create_model(input_shape, byte_embedding_size=2, lite=False)
model_malwaresnet = malwaresnet.create_model(input_shape=(file_chunks, file_chunk_size), byte_embedding_size=2)
train_generator = common.generator(list(zip(sha256_train, y_train)), batch_size, file_chunks, file_chunk_size)
test_generator = common.generator(list(zip(sha256_test, y_test)), 1, file_chunks, file_chunk_size)
training_history = model_malwaresnet.fit_generator(train_generator,
steps_per_epoch=math.ceil(len(sha256_train) / batch_size),
epochs=20,
callbacks=[
EarlyStopping( patience=1 ),
ModelCheckpoint( 'malwaresnet.h5', save_best_only=True),
LearningRateScheduler(lambda epoch: common.schedule(epoch, start=0.1, decay=0.5, every=1))],
validation_data=test_generator,
validation_steps=len(sha256_test))
Okay, no more snarkiness about impatient millenials from me. Each epoch is taking 31000 s = 517 min = 8hrs 36 minutes. But, we're going to remain optimistic that this is going to be awesome. With a name like MalwaResNet, how could it not be?
In [4]:
from keras.models import load_model
# we'll load the "best" model (in this case, the penultimate model) that was saved
# by our ModelCheckPoint callback
model_malwaresnet = load_model('malwaresnet.h5')
# we could load the "best" model, but in this case, the "best" model is the penultimate, and not much better
# than the model we have in hand
y_pred = []
for sha256, lab in zip(sha256_test, y_test):
y_pred.append(
model_malwaresnet.predict_on_batch(
np.asarray([common.get_file_data(sha256, lab, max_file_length)]).reshape(
(-1, file_chunks, file_chunk_size))
)
)
common.summarize_performance(np.asarray(y_pred).flatten(), y_test, "End-to-end convnet")
Out[4]:
(┛◉Д◉)┛彡┻━┻
For real, though. Even if MalwaResNet were a killer architecture for malware, it'd need to be trained on a lot more than a paltry 100K malicious/benign samples, and the optimization and architecture would need some revamping (for example, ResNet is trained over hundreds of epochs...on metric ton of relatively small images). But as it is, this special model isn't really even close in performance to our not special multilayer perceptron. (It's probable that if I let this train for another few weeks, we'd see some modest improvements...but not seeing it get near 0.99+ AUC.)
Alright, the point of this exercise has been to demonstrate that